These notes are based on Chapter 7.4-7.6 of Kroese, D. P., & Chan, J. C. (2014). Statistical modeling and computation. New York: Springer.
In principle, the rejection sampling can work in a high-dimensional problem if we can find a PDF \(g\) from which we can efficiently sample and \(C\) for a tight envelope \(Cg(x)\). It turns out this is a big IF. In 1-D cases, looking at the graph of target \(f(x)\), we are often able to design a reasonable \(Cg(x)\). Even if an envelope is not very tight, we can wait for some time and obtain enough samples from \(\text{U}(0,Cg(x))\) that are smaller than \(f(x)\). This is unlikely the case in high-dimensional problems. Our visual images in a 4-D or higher space are poor (at least for me). In addition, the rejection rate exponentially increases with dimensions, and we may not be allowed to wait for hours or days.
Markov chain Monte Carlo (MCMC) is a general class of methods to sequentially generate dependent samples from approximate distributions that increasingly resemble the target distribution. A sequence of samples are generated by simulating a Markov chain, which is designed to have the limiting distribution equal to the target distribution. By being content with dependent samples from approximate distributions, MCMC enables us to handle functions on high-dimensional spaces, which are common in modern complex problems.
Since Monte Carlo simply refers to stochastic simulation, we might just call it Markov chain simulation. But, MCMC has sort of become the standard name everybody using statistical computing understands, perhaps because the acronym is just catchy.
A Markov chain is a sequence of random variables \(X_1,X_2\dots\) (i.e., a stochastic process) whose futures are conditionally independent of the past given the present. That is, for all \(i\geq0\), \[(X_{i+1}|X_i,X_{i-1},\dots) \sim (X_{i+1}|X_i) .\]
In an IID sequence (e.g., Bernoulli process), futures are independent even of the present.
If each \(X_i\) is a discrete random variable, \[P(X_{i+1}=x_{i+1}|X_i=x_i,X_{i-1}=x_{i-1},\dots) = P(X_{i+1}=x_{i+1}|X_i=x_i).\] In particular, if \(x_{i+1}\) is an element of a finite set that is common for all \(i\geq0\), then the Markov chain can be represented by the transition matrix, which we denote by \(Q\). That is, for all \(i\geq0\), \[Q(j,k) = P(X_{i+1}=k|X_i=j) = P(X_1=k|X_0=j)\] where \(j,k\) are elements of the finite set, which is usually called state space. For instance, if we start with \(X_0=j\), the probability of \(X_2=k\) is \[\begin{align} P(X_2=k|X_0=j) &= \sum_{s=1}^6 P(X_2=k,X_1=s|X_0=j)\\ &= \sum_{s=1}^6 P(X_2=k|X_1=s)\cdot P(X_1=s|X_0=j)\\ &= \sum_{s=1}^6 Q(j,s)\cdot Q(s,k)\\ &= Q^2(j,k)\\ \end{align}\]
To see the last equality, for fixed \(j\) and \(k\), take the \(j\)th row and the \(k\)th column of \(Q\) and compute their dot-product, which is the \((j,k)\) element of \(Q^2=Q\cdot Q\).
As a concrete example, the following figure illustrates a Markov chain with state space \(\{1,2,3,4,5,6\}\).
Q## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.0 0.2 0.0 0.3 0.5 0.0
## [2,] 0.5 0.0 0.5 0.0 0.0 0.0
## [3,] 0.3 0.0 0.0 0.7 0.0 0.0
## [4,] 0.0 0.0 0.0 0.0 0.0 1.0
## [5,] 0.0 0.0 0.0 0.8 0.0 0.2
## [6,] 0.0 0.0 0.1 0.0 0.9 0.0
For example, from \(Q^2\), you can read off the probability of \(X_2=5\) given \(X_0=3\) as 0.15.
Q%*%Q## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.10 0.00 0.10 0.40 0.00 0.40
## [2,] 0.15 0.10 0.00 0.50 0.25 0.00
## [3,] 0.00 0.06 0.00 0.09 0.15 0.70
## [4,] 0.00 0.00 0.10 0.00 0.90 0.00
## [5,] 0.00 0.00 0.02 0.00 0.18 0.80
## [6,] 0.03 0.00 0.00 0.79 0.00 0.18
In general, the probability of \(X_n=k\) given \(X_0=j\) is \(Q^n(j,k)\).
For most MCMC applications, we are interested in continuous random variables, i.e., the state space is uncountable. In this case, for \(A\subset\mathbb{R}^d\), \[P(X_{i+1}\in A|X_i=x_i,X_{i-1}=x_{i-1},\dots) = P(X_{i+1}\in A|X_i=x_i) .\]
In particular, we assume the conditional PDF \(f_{X_{i+1}|X_i=x}(y|x)\) does not depend on \(i\) and denote it by \(q(y|x)\). That is, for all \(i\geq0\), \[q(y|x) = f_{X_{i+1}|X_i=x}(y|x)=f_{X_1|X_0=x}(y|x),\] where \(x,y\in\mathbb{R}^d\) are states. We call \(q\) the transition density of a (continuous-state) Markov chain.
While in general each of \(X_0,X_1,\dots\) follows a distinct marginal distribution, there is a special common distribution. A state distribution (i.e., marginal distribution) \(f\) is said to be stationary if \(f\) satisfies the following global balance equations: \[f(x_{i+1}) = \int f(x_{i})q(x_{i+1}|x_{i})dx_{i}.\]
To see how special the global balance is, recall that we always have \[f_{i+1}(x_{i+1}) = \int f_{i}(x_{i})q(x_{i+1}|x_{i})dx_{i},\] by marginalisation, i.e., integrating out \(x_i\) from the joint PDF. In general, two marginal PDFs \(f_{i+1}\) and \(f_{i}\) need not be the same. However, if they happen to be identical under the transition \(q\), then \(f_{i+1} = f_{i} = f\) is a stationary distribution in the Markov chain characterised by \(q\).
Keep in mind that it is a set of equations, as they hold for all \(x_{i+1}\). This is easy to see in a finite-state case: \[p_{i+1} = p_{i}Q,\] where \(p_i\) and \(p_{i+1}\) are state distributions. That is, each column is a result of marginalisation, and they are computed all at once using the matrix multiplication by \(Q\). Naturally, a state distribution \(p\) is stationary if \[p = pQ.\]
Given initial \(x_0\), using \(q\), we can simulate a Markov chain, i.e., sequentially obtain samples \(x_1,x_2,\dots,x_n\). In other words, we get samples from the following joint PDF \[f(x_1,\dots,x_n|x_0) = q(x_n|x_{n-1})q(x_{n-1}|x_{n-2}) \cdots q(x_1|x_0).\] The rationale for Markov chains as a method for sampling from desired target distributions is the following.
\[\begin{align} f(x_{I+1},x_{I+2},\dots,x_n|x_{I}) &= q(x_{n}|x_{n-1}) \cdots q(x_{I+2}|x_{I+1})q(x_{I+1}|x_{I})\\ &\approx f(x_{I+1})f(x_{I+2})\cdots f(x_n). \end{align}\]
A common confusion. Be careful what is converging and from what distribution each sample is taken. When it comes to Markov chains, generally, we have no way to directly sample from the marginal distribution \(f_{i+1}\). When simulating a Markov chain, we take each sample \(x_{i+1}\) from the conditional distribution \(q(x_{i+1}|x_i)\), not from the marginal \(f_{i+1}\). The transition \(q\) defines the Markov chain and remains unchanged no matter how long we simulate the chain. So, the convergence of marginal distributions does not mean each sample is independently taken from the stationary distribution \(f\). Hence, the collective resemblance.
It is easy to demonstrate the convergence for a finite-state case. For instance, for \(Q\) defined above, the stationary distribution \(p\) that satisfies \[p = pQ,\] is an eigenvector of \(Q^T\) for eigenvalue of \(1\).
p <- Re(eigen(t(Q))$vectors[,1]) # eigenvector for eigenvalue = 1
p <- p/sum(p) # normalisation
p## [1] 0.011978917 0.002395783 0.035936751 0.283660757 0.318639195 0.347388596
We can see \(Q^{500}\) having the identical rows, each of which is almost equal to the stationary distribution.
n <- 500
Qn <- Q%^%n
Qn## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.01197892 0.002395783 0.03593675 0.2836608 0.3186392 0.3473886
## [2,] 0.01197892 0.002395783 0.03593675 0.2836608 0.3186392 0.3473886
## [3,] 0.01197892 0.002395783 0.03593675 0.2836608 0.3186392 0.3473886
## [4,] 0.01197892 0.002395783 0.03593675 0.2836608 0.3186392 0.3473886
## [5,] 0.01197892 0.002395783 0.03593675 0.2836608 0.3186392 0.3473886
## [6,] 0.01197892 0.002395783 0.03593675 0.2836608 0.3186392 0.3473886
What does this imply? For example, suppose \(X_0=4\) and \(p_0 = (0,0,0,1,0,0)\). Then,
p0 <- c(0,0,0,1,0,0)
p0%*%Qn## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.01197892 0.002395783 0.03593675 0.2836608 0.3186392 0.3473886
In general,
p0 <- runif(6)
p0 <- p0/sum(p0) # random initial state distribution
p0%*%Qn## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.01197892 0.002395783 0.03593675 0.2836608 0.3186392 0.3473886
No matter which state we are in at the beginning, we will end up with the same state distribution at time 500 onwards. In other words, an estimate of this limiting distribution using MCMC is
n <- 1e6
x <- rep(0, n)
x[1] <- sample(1:6, 1) # random initial state
for (i in 2:n) {
x[i] <- sample(1:6, 1, prob=Q[x[i-1],]) # MCMC sampling
}
table(x[501:n])/(n-500) # 500 burn-in periods##
## 1 2 3 4 5 6
## 0.012044022 0.002474237 0.036213107 0.283649825 0.318318159 0.347300650
In contrast, if we sample from the true stationary distribution, an MC estimate is
x <- sample(1:6, n-500, replace=T, prob=p)
table(x)/(n-500)## x
## 1 2 3 4 5 6
## 0.01195398 0.00245923 0.03596598 0.28437719 0.31852526 0.34671836
The task in MCMC applications is to design a good Markov chain, equivalently design a good transition \(q\). Key properties of good \(q\) are:
Let’s discuss the second condition, convergence, while we will come back to the other two later.
Ergodicity is the condition for a Markov chain to converge to a unique stationary distribution irrespective of \(x_0\). It consists of two intuitive properties: irreducibility and aperiodicity.
Irreducibility ensures that every state \(x\) can be visited from every other state within finite steps \((i<\infty)\). This is important because, in MCMC, a sequence of samples are generated through dependent sampling according to the transition density \(q\), and some \(x_{i+1}\) may not be reached directly from current \(x_i\), i.e., \(q(x_{i+1}|x_i)=0\).
Aperiodicity prevents indefinite oscillation and allows the chain to converge to a stationary distribution.
See Roberts & Rosenthal (2004) for more details.
Among the ergodic Markov chains, we could in principle choose a transition density \(q\) that satisfies the global balance equations in order to ensure that the chain converges to our target \(f\). In practice, however, this is tricky and challenging because of the integral in the definition the global balance equations. So, we typically impose stronger structure on \(q\) called the detailed balance equations: \[f(x)q(y|x) = f(y)q(x|y), \;\forall x,y.\]
By integrating both sides over \(x\) (or \(y\)), we see the detailed balance equations imply the global balance equations.
Here is an example of MCMC for sampling from a bivariate normal distribution \(\text{N}(\mu,\Sigma)\) with \[ \mu=0 \quad\text{and}\quad \Sigma=\begin{bmatrix}1 & 0.7\\0.7 & 1\\\end{bmatrix} .\]
Its PDF looks like this:
Using Gibbs sampler, a result may look like the following.
set.seed(123)
n <- 300
m <- 3
# Initialisation
x <- -1
xx <- rep(x, n)
y <- 2
yy <- rep(y, n)
# Run
for (i in 2:n) {
x <- rnorm(1, mean=r*y, sd=sqrt(1-r^2))
xx[i] <- x
y <- rnorm(1, mean=r*x, sd=sqrt(1-r^2))
yy[i] <- y
par(mar=c(5,5,.1,.1))
plot(xx[1], yy[1], xlab="x", ylab="y", xlim=c(-3,3), ylim=c(-3,3), pch=4)
if (i <= m) {
segments(xx[-i],yy[-i],xx[-1],yy[-1],lwd=.5)
} else {
points(xx[2:(i-(m-1))], yy[2:(i-(m-1))], pch=20, cex=.4)
segments(xx[(i-m):(i-2)], yy[(i-m):(i-2)],
xx[(i-(m-1)):(i-1)], yy[(i-(m-1)):(i-1)], lwd=.5)
}
arrows(xx[i-1],yy[i-1],xx[i],yy[i], length=0.08, angle=20, lwd=1)
}Pay attention to the dependency of each new point on the previous point; each arrow goes only so far. Each is not an IID sample from the target distribution. But, collectively, these points are distributed as if each is sampled from the target distribution.
Another example that consists of two clusters gives us an even clearer picture of dependency. Two clusters consists of \(\text{N}(\mu_0,\Sigma)\) and \(\text{N}(\mu_1,\Sigma)\), where \[ \mu_0=\begin{bmatrix}-1.7\\-1.7\end{bmatrix},\; \mu_1=\begin{bmatrix}1.7\\1.7\end{bmatrix}, \;\text{and}\quad \Sigma=\begin{bmatrix}1 & 0\\0 & 1\\\end{bmatrix}\] and the cluster assignment \(C\) follows \(\text{Bernoulli}(0.5)\). In other words, the PDF is \[f(x_1,x_2,c) = \frac{1}{2}\frac{1}{2\pi}\exp\left(-\frac{1}{2}(x-\mu_c)^T(x-\mu_c)\right)\]
set.seed(123)
n <- 300
m <- 3
burn_in <- 5000
C <- c(-1.7,1.7)
# Initialisation
x <- 0
y <- 0
c <- C[1]
# Run for the burn-in
for (i in 2:burn_in) {
x <- rnorm(1, c)
y <- rnorm(1, c)
k <- f(x,y,C[1]) + f(x,y,C[2])
c <- sample(c(C[1],C[2]), size=1, prob=c(f(x,y,C[1])/k, f(x,y,C[2])/k))
}
# Run for the plot
xx <- rep(x,n)
yy <- rep(y,n)
cc <- rep(c,n)
for (i in 2:n) {
x <- rnorm(1, c)
xx[i] <- x
y <- rnorm(1, c)
yy[i] <- y
k <- f(x,y,C[1]) + f(x,y,C[2])
c <- sample(c(C[1],C[2]), size=1, prob=c(f(x,y,C[1])/k, f(x,y,C[2])/k))
cc[i] <- c
par(mar=c(5,5,.1,.1))
plot(xx[1], yy[1], xlab="x", ylab="y", xlim=c(-4,4), ylim=c(-4,4),
pch=20, cex=.4, col=cc[1]/C[2]+3)
if (i <= m) {
segments(xx[-i],yy[-i],xx[-1],yy[-1],lwd=.5)
} else {
points(xx[2:(i-(m-1))], yy[2:(i-(m-1))], pch=20, cex=.4, col=cc[2:(i-(m-1))]/C[2]+3)
segments(xx[(i-m):(i-2)], yy[(i-m):(i-2)],
xx[(i-(m-1)):(i-1)], yy[(i-(m-1)):(i-1)], lwd=.5)
}
arrows(xx[i-1],yy[i-1],xx[i],yy[i], length=0.08, angle=20, lwd=1)
}As you can see, most samples are followed by samples in the same cluster. If samples were independent, the arrow would jump between the clusters almost every other time.
Since convergence requires more time in this example than the first example, 5000 earliest samples are burned.
A takeaway message at this point is that, by running MCMC for a long time, we can collect a set of samples as if they were drawn from the target PDF. Since MC estimates often require many samples anyway, simulating a long chain may not seem like an issue. However, it may be too long and collecting much more samples than required for good MC estimation. In practice, we prefer 1,000 good samples to 1,000,000 good samples when 1,000 good samples are good enough.
In practice, most of us do not design a transition density \(q\) from scratch; rather, we use proven design templates. The Metropolis-Hastings (MH) algorithm provides a good template. Suppose we have an ergodic Markov chain characterised by a transition density \(q\). MH tells us how to build upon this Markov chain and design another one that satisfies the detailed balance equations with respect to the target \(f\).
The ergodiciy of the new chain is implied by the ergodiciy of the origianl Markov chain with \(q\).
The idea is similar to the rejection sampling.
The acceptance probability is defined as follows: \[\alpha(x,y) = \min\Bigg(1,\; \frac{f(y)q(x|y)}{f(x)q(y|x)}\Bigg).\]
In the special case of symmetric transition density, i.e., for all \(x,y\), \[q(y|x) = q(x|y),\] it is referred to as Metropolis algorithm with the acceptance probability \[\alpha(x,y) = \min\Bigg(1,\; \frac{f(y)}{f(x)}\Bigg).\] The one which is naïve yet widely used in practice is a normal distribution centred at \(x\): \[q(y|x) \propto \exp\left(-\frac{1}{2\sigma^2}(y-x)^T(y-x)\right)\] where \(\sigma^2\) is the common variance across all the component of \(y\). Hence, the symmetry.
An intuition is that, when moving to a state of higher density, we always accept the proposal \(y\). Otherwise, we make a probabilistic decision according to the ratio \(f(y)/f(x)\).
Given the acceptance probability defined above, the new transition density for \(x_{i+1}\neq x_i\) is \[\tilde{q}(x_{i+1}|x_i) = q(x_{i+1}|x_i)\alpha(x_i,x_{i+1}) .\] Less importantly, the probability of not moving is \[P(X_{i+1}=x_i|X_i=x_i) = 1 - \int_{y\neq x_i} q(y|x_i)\alpha(x_i,y)dy .\]
Let’s make sure that \(\tilde{q}\) satisfies the detailed balance equations: \[f(x)\tilde{q}(y|x) = f(y)\tilde{q}(x|y), \;\forall x,y.\] First, notice that, regardless of \(\tilde{q}\), it trivially holds for \(y=x\). This is why \(X_{i+1}=x_i\) is unimportant. Next, for \(y\neq x\), \[\begin{align} f(x)\tilde{q}(y|x) &= f(x)q(y|x)\alpha(x,y)\\ &= f(x)q(y|x)\min\Bigg(1, \frac{f(y)q(x|y)}{f(x)q(y|x)}\Bigg)\\ &= \min\big(f(x)q(y|x),\; f(y)q(x|y)\big)\\ &= \min\big(f(y)q(x|y),\; f(x)q(y|x)\big)\\ &= f(y)\tilde{q}(x|y). \end{align}\]
Gibbs sampling (a.k.a. alternating conditional sampling) is a special case of MH and widely used to address multidimensional problems. It is particularly suitable for Bayesian hierarchical models due to the modularised model construction. Suppose \(x\) is a vector in a multidimensional space and divided into \(d\) sub-vectors \(x^{(1)},x^{(2)},\dots,x^{(d)}\).
n.b. Each sub-vector need not be one-dimensional.
Let \(f^{(k)}\) be the conditional PDF of \(x^{(k)}\) \((k\in\{1,\dots,d\})\) given the values of the other sub-vectors: \[f^{(k)}(x^{(k)}|x^{(1)},\dots,x^{(k-1)},x^{(k+1)},\dots,x^{(d)}) .\] Gibbs sampler assumes we can sample from each of these conditional PDFs, and proceeds as follows.
You get the idea. \(x_{i+1} = (x_{i+1}^{(1)},\dots,x_{i+1}^{(d)})\) is technically a proposal in the context of MH, but \(x_{i+1}\) is always accepted in Gibbs sampling.
Go back to the demos above and see whether the code makes sense.
To see Gibbs sampling as a special case of MH, place \(x_{i+1}^{(k)}\) in a full-length vector \(x\) and index it as \(j+1\): \[\begin{align} \vdots&\\ x_{j} &= (x_{i+1}^{(1)},\dots,x_{i+1}^{(k-1)},x_{i}^{(k)},x_{i}^{(k+1)},\dots,x_{i}^{(d)})\\ x_{j+1} &= (x_{i+1}^{(1)},\dots,x_{i+1}^{(k-1)},x_{i+1}^{(k)},x_{i}^{(k+1)},\dots,x_{i}^{(d)})\\ \vdots& \end{align}\] In other words, when the state changes from \(x_j\) to \(x_{j+1}\) in the Markov chain, only one of \(d\) sub-vectors \(x^{(1)},x^{(2)},\dots,x^{(d)}\) changes. The transition density \(q(x_{j+1}|x_j)\) is \[\begin{align} q(x_{j+1}|x_j) &= f^{(k)}(x_{i+1}^{(k)}|x_{i+1}^{(1)},\dots,x_{i+1}^{(k-1)},x_{i}^{(k+1)},\dots,x_{i}^{(d)})\\ &= f^{(k)}(x_{i+1}^{(k)}|x^{(-k)}). \end{align}\] for \(x_{j+1}\) with \(x^{(-k)} = (x_{i+1}^{(1)},\dots,x_{i+1}^{(k-1)},x_{i}^{(k+1)},\dots,x_{i}^{(d)})\) unchanged, and 0 otherwise.
If the sloppiness bothers you, imagine the densities over the entire \(x\) are concentrated on the slice at \(x^{(-k)}\).
Note that the transition densities of \(x_j\to x_{j+1}\) and \(x_{j+1}\to x_j\) are both given by the conditional PDF \(f^{(k)}\): \[\begin{align} q(x_{j+1}|x_j) &= f^{(k)}(x_{i+1}^{(k)}|x^{(-k)})\\ q(x_j|x_{j+1}) &= f^{(k)}(x_{i}^{(k)}|x^{(-k)}). \end{align}\] Now, it follows that \[\begin{align} \alpha(x_{j},x_{j+1}) &= \min\Bigg(1,\; \frac{f(x_{j+1})\cdot q(x_{j}|x_{j+1})}{f(x_{j})\cdot q(x_{j+1}|x_{j})}\Bigg)\\ &= \min\Bigg(1,\; \frac{f^{(k)}(x_{i+1}^{(k)}|x^{(-k)})\cdot f^{(-k)}(x^{(-k)})\cdot f^{(k)}(x_{i}^{(k)}|x^{(-k)})}{f^{(k)}(x_{i}^{(k)}|x^{(-k)})\cdot f^{(-k)}(x^{(-k)})\cdot f^{(k)}(x_{i+1}^{(k)}|x^{(-k)})}\Bigg)\\ &= \min(1, 1)\\ &= 1 \end{align}\] Hence, a proposal is always accepted.
It is straightforward to recognise that the target \(f\) is the stationary distribution under Gibbs sampling. First, the density of \(x^{(-k)} = (x^{(1)},\dots,x^{(k-1)},x^{(k+1)},\dots,x^{(d)})\) remains invariant because no sampling occurs for \(x^{(-k)}\). Next, by definition, Gibbs sampler draws a sample of \(x^{(k)}\) from the conditional PDF \(f^{(k)}(x^{(k)}|x^{(-k)})\), which is derived from the target PDF \(f\). Thus, the PDF for the state distributions both before and after the transition is identically \[f^{(k)}(x^{(k)}|x^{(-k)})\cdot f^{(-k)}(x^{(-k)})= f(x).\] Hence, the stationary distribution under Gibbs sampling.
n.b. This does not mean that no burn-in period is necessary for Gibbs sampling, because \(x^{(-k)}\) has to be distributed according to \(f^{(-k)}\), which requires convergence.
Imagine you try to model the number of customers at each of two bars in Melbourne using Poisson distribution, namely \(\text{pois}(\lambda_1)\) for Bar 1 and \(\text{pois}(\lambda_2)\) for Bar 2. Based on other information, you think \(\lambda_1\) and \(\lambda_2\) are distinct but somehow related. So, you assume both \(\lambda_1\) and \(\lambda_2\) to be distributed as \(\text{exp}(\theta)\). Finally, you complete the model by simply assuming \(\theta \sim \text{exp}(1)\).
Suppose you are going to collect \(m_1\) samples from Bar 1 and \(m_2\) samples from Bar 2. Then, your model is:
\[\begin{gather} X_1, X_2, \dots, X_{m_1} \sim \text{pois}(\lambda_1)\\ X_{m_1+1}, X_{m_1+2}, \dots, X_{m_1+m_2} \sim \text{pois}(\lambda_2)\\ \lambda_1, \lambda_2 \sim \text{exp}(\theta)\\ \theta \sim \text{exp}(1) \end{gather}\] (As usual, the observations are independent. In addition, \(\lambda_1\) and \(\lambda_2\) are independent given \(\theta\).)
You naturally expect learning something about the average numbers of customers at Bar 1 and Bar 2 (\(\lambda_1\) and \(\lambda_2\)) and how they are related.
Given the collected data \[\mathbf{x}=(\mathbf{x}_1,\mathbf{x}_2)=(x_1,\dots,x_{m_1},x_{m_1+1},\dots,x_{m_1+m_2}),\] you derive \(f(\lambda_1, \lambda_2, \theta|\mathbf{x})\), a joint PDF of \((\lambda_1, \lambda_2, \theta)\) conditional on \(\mathbf{x}\).
\[\begin{align} f(\lambda_1,\lambda_2,\theta|\mathbf{x}) &= \frac{f(\mathbf{x},\lambda_1,\lambda_2,\theta)}{f(\mathbf{x})}\\ &\propto f(\mathbf{x},\lambda_1,\lambda_2,\theta)\\ &= f(\mathbf{x}|\lambda_1,\lambda_2,\theta)\cdot f(\lambda_1,\lambda_2,\theta)\\ &= f((\mathbf{x}_1,\mathbf{x}_2)|\lambda_1,\lambda_2)\cdot f(\lambda_1,\lambda_2|\theta)\cdot f(\theta)\\ &= f(\mathbf{x}_1|\lambda_1)\cdot f(\mathbf{x}_2|\lambda_2)\cdot f(\lambda_1|\theta)\cdot f(\lambda_2|\theta)\cdot f(\theta)\\ &= \prod_{i=1}^{m_1}f(x_i|\lambda_1) \cdot \prod_{j=1}^{m_2}f(x_j|\lambda_2) \cdot f(\lambda_1|\theta) \cdot f(\lambda_2|\theta) \cdot f(\theta)\\ &= \prod_{i=1}^{m_1}\frac{e^{-\lambda_1}\lambda_1^{x_i}}{x_i!}\cdot\prod_{j=1}^{m_2}\frac{e^{-\lambda_2}\lambda_2^{x_j}}{x_j!}\cdot\theta e^{-\theta\lambda_1}\cdot\theta e^{-\theta\lambda_2}\cdot e^{-\theta}\\ &\propto e^{-m_1\lambda_1}\lambda_1^{\sum_{i=1}^{m_1}x_i}\cdot e^{-m_2\lambda_2}\lambda_2^{\sum_{j=1}^{m_2}x_j}\cdot\theta e^{-\theta\lambda_1}\cdot\theta e^{-\theta\lambda_2}\cdot e^{-\theta}\\ &= \theta^2 \cdot e^{-\theta} \cdot e^{-(\theta+m_1)\lambda_1}\cdot e^{-(\theta+m_2)\lambda_2} \cdot \lambda_1^{\sum_{i=1}^{m_1}x_i} \cdot \lambda_2^{\sum_{j=1}^{m_2}x_j} \end{align}\]
\(f(\lambda_1, \lambda_2, \theta|\mathbf{x})\) does not look like any familiar PDF.
Let’s implement a Gibbs sampler to sample from this joint PDF. First, we need to identify the following full conditionals from the above joint (perhaps, by purposefully organising variables and eyeballing).
\[\begin{align} f(\lambda_1|\lambda_2,\theta,\mathbf{x}) &\propto \lambda_1^{\sum_{i=1}^{m_1}x_i} e^{-(\theta+m_1)\lambda_1} \sim \text{gamma}\Bigg(1+\sum_{i=1}^{m_1}x_i, \theta+m_1\Bigg)\\ f(\lambda_2|\lambda_1,\theta,\mathbf{x}) &\propto \lambda_2^{\sum_{j=1}^{m_2}x_j} e^{-(\theta+m_2)\lambda_2} \sim \text{gamma}\Bigg(1+\sum_{j=1}^{m_2}x_j, \theta+m_2\Bigg)\\ f(\theta|\lambda_1,\lambda_2,\mathbf{x}) &\propto \theta^2 e^{-(1+\lambda_1+\lambda_2)\theta} \sim \text{gamma}(3, 1+\lambda_1+\lambda_2). \end{align}\]
Then, you cycle through them for \(n\) rounds of sampling and return a \(n\times 3\) matrix.
sampler <- function(n, x1, x2) {
set.seed(90045)
m1 <- length(x1)
m2 <- length(x2)
s1 <- sum(x1)
s2 <- sum(x2)
D <- matrix(0, n, 3)
for (i in 2:n) {
D[i,1] <- rgamma(1, 1+s1, m1+D[i-1,3])
D[i,2] <- rgamma(1, 1+s2, m2+D[i-1,3])
D[i,3] <- rgamma(1, 3, 1+D[i,1]+D[i,2])
}
return(D)
}For example,
x1 <- c(10,9,6,19,8,13,15,7,11)
x2 <- c(14,10,11,9,9,6,13,19,10,8,6,13,8,15,21,7,12,11)
n <- 1e6
D <- sampler(n, x1, x2)
burn_in <- floor(0.05*n) # throwing away 1st 5% of samples
D <- D[(burn_in+1):n,]Using the samples D, you can approximate many
quantities. For example,
\[\begin{align} \mu_1 &= \mathbb{E}[\lambda_1|\mathbf{x}]\\ \mu_2 &= \mathbb{E}[\lambda_2|\mathbf{x}]\\ \sigma^2_1 &= \mathbb{E}[(\lambda_1-\mu_1)^2|\mathbf{x}]\\ \sigma^2_2 &= \mathbb{E}[(\lambda_1-\mu_2)^2|\mathbf{x}]\\ \text{Cov}(\lambda_1,\lambda_2) &= \mathbb{E}[(\lambda_1-\mu_1)(\lambda_2-\mu_2)|\mathbf{x}]\\ \rho &= \frac{\text{Cov}(\lambda_1,\lambda_2)}{\sigma_1\sigma_2} \end{align}\]
MC estimates are
mu1 <- mean(D[,1])
mu2 <- mean(D[,2])
sig1 <- mean((D[,1]-mu1)^2)
sig2 <- mean((D[,2]-mu2)^2)
cov <- mean((D[,1]-mu1)*(D[,2]-mu2))
rho <- cov/sqrt(sig1*sig2)\[\hat{\mu}_1 = 10.84,\; \hat{\mu}_2 = 11.2,\; \hat{\rho} = 0.0047.\] It seems two bars have very similar average customers. However, as far as the correlation is concerned, they are unrelated.